AITopics | agent evaluation

Collaborating Authors

agent evaluation

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

SAGE: A Top-Down Bottom-Up Knowledge-Grounded User Simulator for Multi-turn AGent Evaluation

Shea, Ryan, Lu, Yunan, Qiu, Liang, Yu, Zhou

arXiv.org Artificial IntelligenceOct-15-2025

Evaluating multi-turn interactive agents is challenging due to the need for human assessment. Evaluation with simulated users has been introduced as an alternative, however existing approaches typically model generic users and overlook the domain-specific principles required to capture realistic behavior. We propose SAGE, a novel user Simulation framework for multi-turn AGent Evaluation that integrates knowledge from business contexts. SAGE incorporates top-down knowledge rooted in business logic, such as ideal customer profiles, grounding user behavior in realistic customer personas. We further integrate bottom-up knowledge taken from business agent infrastructure (e.g., product catalogs, FAQs, and knowledge bases), allowing the simulator to generate interactions that reflect users' information needs and expectations in a company's target market. Through empirical evaluation, we find that this approach produces interactions that are more realistic and diverse, while also identifying up to 33% more agent errors, highlighting its effectiveness as an evaluation tool to support bug-finding and iterative agent improvement.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2510.11997

Country:

Asia (1.00)
North America > United States (0.28)
Europe > Austria (0.28)

Genre: Research Report (1.00)

Industry:

Consumer Products & Services (1.00)
Information Technology > Security & Privacy (0.66)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.95)
(2 more...)

Add feedback

Evolutionary Perspectives on the Evaluation of LLM-Based AI Agents: A Comprehensive Survey

Zhu, Jiachen, Zhu, Menghui, Rui, Renting, Shan, Rong, Zheng, Congmin, Chen, Bo, Xi, Yunjia, Lin, Jianghao, Liu, Weiwen, Tang, Ruiming, Yu, Yong, Zhang, Weinan

arXiv.org Artificial IntelligenceJun-16-2025

The advent of large language models (LLMs), such as GPT, Gemini, and DeepSeek, has significantly advanced natural language processing, giving rise to sophisticated chatbots capable of diverse language-related tasks. The transition from these traditional LLM chatbots to more advanced AI agents represents a pivotal evolutionary step. However, existing evaluation frameworks often blur the distinctions between LLM chatbots and AI agents, leading to confusion among researchers selecting appropriate benchmarks. To bridge this gap, this paper introduces a systematic analysis of current evaluation approaches, grounded in an evolutionary perspective. We provide a detailed analytical framework that clearly differentiates AI agents from LLM chatbots along five key aspects: complex environment, multi-source instructor, dynamic feedback, multi-modal perception, and advanced capability. Further, we categorize existing evaluation benchmarks based on external environments driving forces, and resulting advanced internal capabilities. For each category, we delineate relevant evaluation attributes, presented comprehensively in practical reference tables. Finally, we synthesize current trends and outline future evaluation methodologies through four critical lenses: environment, agent, evaluator, and metrics. Our findings offer actionable guidance for researchers, facilitating the informed selection and application of benchmarks in AI agent evaluation, thus fostering continued advancement in this rapidly evolving research domain.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2506.11102

Genre:

Overview (1.00)
Workflow (0.69)
Research Report > New Finding (0.34)

Industry:

Education (1.00)
Media (0.92)
Leisure & Entertainment > Games > Computer Games (0.68)
Information Technology > Software (0.68)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Reinforcement Learning for Stock Transactions

Zhou, Ziyi, Stern, Nicholas, Laasri, Julien

arXiv.org Artificial IntelligenceMay-27-2025

Much research has been done to analyze the stock market. After all, if one can determine a pattern in the chaotic frenzy of transactions, then they could make a hefty profit from capitalizing on these insights. As such, the goal of our project was to apply reinforcement learning (RL) to determine the best time to buy a stock within a given time frame. With only a few adjustments, our model can be extended to identify the best time to sell a stock as well. In order to use the format of free, real-world data to train the model, we define our own Markov Decision Process (MDP) problem. These two papers [5] [6] helped us in formulating the state space and the reward system of our MDP problem. We train a series of agents using Q-Learning, Q-Learning with linear function approximation, and deep Q-Learning. In addition, we try to predict the stock prices using machine learning regression and classification models. We then compare our agents to see if they converge on a policy, and if so, which one learned the best policy to maximize profit on the stock market.

artificial intelligence, machine learning, reinforcement learning, (17 more...)

arXiv.org Artificial Intelligence

2505.16099

Genre: Research Report (0.50)

Industry: Banking & Finance > Trading (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

AI Agents That Matter

Kapoor, Sayash, Stroebl, Benedikt, Siegel, Zachary S., Nadgir, Nitya, Narayanan, Arvind

arXiv.org Artificial IntelligenceJul-1-2024

AI agents are an exciting new research direction, and agent development is driven by benchmarks. Our analysis of current agent benchmarks and evaluation practices reveals several shortcomings that hinder their usefulness in real-world applications. First, there is a narrow focus on accuracy without attention to other metrics. As a result, SOTA agents are needlessly complex and costly, and the community has reached mistaken conclusions about the sources of accuracy gains. Our focus on cost in addition to accuracy motivates the new goal of jointly optimizing the two metrics. We design and implement one such optimization, showing its potential to greatly reduce cost while maintaining accuracy. Second, the benchmarking needs of model and downstream developers have been conflated, making it hard to identify which agent would be best suited for a particular application. Third, many agent benchmarks have inadequate holdout sets, and sometimes none at all. This has led to agents that are fragile because they take shortcuts and overfit to the benchmark in various ways. We prescribe a principled framework for avoiding overfitting. Finally, there is a lack of standardization in evaluation practices, leading to a pervasive lack of reproducibility. We hope that the steps we introduce for addressing these shortcomings will spur the development of agents that are useful in the real world and not just accurate on benchmarks.

agent, benchmark, evaluation, (16 more...)

arXiv.org Artificial Intelligence

2407.01502

Country:

North America > United States > New York > New York County > New York City (0.04)
Europe > Slovenia > Drava > Municipality of Benedikt > Benedikt (0.04)
Asia > Middle East > Jordan (0.04)
Africa > Eswatini > Manzini > Manzini (0.04)

Genre: Research Report > New Finding (1.00)

Industry:

Education (0.67)
Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(2 more...)

Add feedback

The Baseline Approach to Agent Evaluation

Davidson, Josh (University of Alberta) | Archibald, Christopher (University of Alberta) | Bowling, Michael (University of Alberta)

AAAI ConferencesJul-9-2013

An important aspect of agent evaluation in stochastic games, especially poker, is the need to reduce the outcome variance in order to get accurate and significant results. The current method used in the Annual Computer Poker Competition’s analysis is that of duplicate poker, an approach that leverages the ability to deal sets of cards to agents in order to reduce variance. This work explores a different approach to variance reduction by using a control variate based approach known as baseline. The baseline approach involves using an agent’s outcome in self play to create an unbiased estimator for use in agent evaluation and has been shown to work well in both poker and trading agent competition domains. Base- line does not require that the agents are able to be dealt sets of cards, making it a more robust technique than duplicate. This approach is compared to the current duplicate method, as well as other variations of duplicate poker on the results of the 2011 two player no-limit and three player limit Texas Hold’em ACPC tournaments.

agent evaluation, baseline approach

AAAI Conferences

Workshops at the Twenty-Seventh AAAI Conference on Artificial Intelligence

Country: North America > United States > Texas (0.24)

Genre: Contests & Prizes (0.53)

Industry: Leisure & Entertainment > Games > Poker (0.53)

Technology: Information Technology > Artificial Intelligence > Games > Poker (0.53)

Add feedback